I searched online to find publically available financial information. My search landed me on this dataset which contains credit default information. Aside from the dataset, the only information that has been provided are the explanations attached to the features. As such, I'll be treating this as an open ended analysis to gain more knowledge and familiarity with this information. Depending on the features, I may also proceed to create machine learning models and test their strengths as predictors.
This publically available dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("Credit_Card_dataset.csv") # load data into DataFrame
data.shape # get an idea of the dataset dimensions
data.isnull().values.any() # verify that DF has no missing values
Just by checking the dimensions and returning zero missing values, this dataset has probably already been preprocessed. Instead of doing a more thorough data integrity investigation, my EDA will be thorough enough to verify data cleanliness. My focus in this section is to better understand the provided information and identify important features.
data.head()
data.tail()
data.sample(5)
list(data)
data.describe()
print("Unknown: ", len(data[data.MARRIAGE == 0]), "|", round(len(data[data.MARRIAGE == 0])/len(data), 3))
print("Married: ", len(data[data.MARRIAGE == 1]), "|", round(len(data[data.MARRIAGE == 1])/len(data), 3))
print("Single: ", len(data[data.MARRIAGE == 2]), "|", round(len(data[data.MARRIAGE == 2])/len(data), 3))
print("Other: ", len(data[data.MARRIAGE == 3]), "|", round(len(data[data.MARRIAGE == 3])/len(data), 3))
print("Unknown 0: ", len(data[data.EDUCATION == 0]), "|", round(len(data[data.EDUCATION == 0])/len(data), 3))
print("Unknown 5: ", len(data[data.EDUCATION == 5]), "|", round(len(data[data.EDUCATION == 5])/len(data), 3))
print("Unknown 6: ", len(data[data.EDUCATION == 6]), "|", round(len(data[data.EDUCATION == 6])/len(data), 3))
Based on the categorization method of PAY / BILL_AMT / PAY_AMT, the summary statistics above will not carry significance.
There are a few strange findings here. As mentioned above, this dataset came with a provided explanation for the features. However, there are some unlabeled findings here. Education has a minimum of 0. This is not documented so we cannot know if this means an education level below high school or something else entirely. It should be considered the same as 5 and 6 - unknown.
Marriage is labeled for 1, 2, and 3 yet it has a minimum of 0. Another unlabeled category that we cannot be certain about. Both of these unknowns consist of just over 1% of their respective categories. They're worth bringing up, but probably not worth removing.
Bringing up PAY / BILL_AMT / PAY_AMT again, they're supposed to have a minimum of -1. There is no mention of the -2. For the purpose of this exploration, I'm going to proceed with these values as is.
Lastly, I have no idea why the PAY_0 is a feature name. This was undoubtedly a mistype since it doesn't follow the naming conventions of the two other similar features. Fixing this below to avoid confusion past this point.
list(data)
sns.pairplot(data)
Perhaps a bit much to look at first glance but there are some interesting patterns here. Let's focus in on some of these distributions while tying them back to the demographic statistics above.
plt.hist(data["LIMIT_BAL"], bins = "auto")
plt.title("Credit Limit Distribution", fontsize = 16)
plt.show()
plt.hist(data["SEX"])
plt.title("Gender Distribution", fontsize = 16)
plt.show()
plt.hist(data["AGE"])
plt.title("Age Distribution", fontsize = 16)
plt.show()
plt.hist(data["EDUCATION"])
plt.title("Education Distribution", fontsize = 16)
plt.show()
plt.hist(data["default.payment.next.month"])
plt.title("Default Next Month Distribution", fontsize = 16)
plt.show()
print("Default Next Month: ", len(data[data["default.payment.next.month"] == 1]), "|",
round(len(data[data["default.payment.next.month"] == 1])/len(data), 3))
print("Non-Default Next Month: ", len(data[data["default.payment.next.month"] == 0]), "|",
round(len(data[data["default.payment.next.month"] == 0])/len(data), 3))
Defaulting is a major concern and something that needs to be further assessed. A correlation heatmap can offer us immediate detail. I'll use this to view the relationship strengths between defaulting and the other 24 features.
corr = data.corr()
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (20, 16))
sns.heatmap(corr, mask = mask, cmap = "RdBu", annot = True)
plt.yticks(rotation = 0)
Unfortunately, there are not any strong correlating relationships between the default category and the others. This may still be used as a target; however, more data exploration should be conducted before building models and applying machine learning.
As an additional note - the relationships between nearly all features are weak. The exceptions are PAY and BILL_AMT when being applied to their same categories but in different months. Without using statistics or a formal analysis, this information holds up with common reasoning.
This dataset provides different demographics and pairings that may be grouped and analyzed. It is worth the time to focus in on certain segmentations and compare their differences against one another. There may or may not be fluctuations between the results, but that is the primary focus of this. To observe, understand, and identify driving factors. To do this, I'll be using notched boxplots as an efficient visualization reference instead of numbers.
def boxplot_generator(category_a, category_b, category_c, plot_title, width):
figure, ax1 = plt.subplots(ncols = 1, figsize = (width, 6))
plt.title(plot_title, fontsize = 16)
nbplot = sns.boxplot(ax = ax1, x = category_a, y = category_b, hue = category_c,
data = data, showfliers = False, palette = "Spectral")
plt.show()
boxplot_generator("SEX", "LIMIT_BAL", "SEX", "Credit Limit by Gender", 4)
There are no statistically significant differences between gender and credit limits. Even with the larger female sample size, males have a slightly larger credit limit distribution. Females also have a slightly higher median credit limit. Again, based on gender, this population may be grouped when considering credit limits.
boxplot_generator("AGE", "LIMIT_BAL", "SEX", "Credit Limit Distribution by Age and Gender", 18)
print("Combined Ages below 25 and above 62: ", len(data[data.AGE < 25]) + len(data[data.AGE > 62]) , "|",
round((len(data[data.AGE < 25]) + len(data[data.AGE > 62]))/len(data), 3))
This more granular distribution by credit limit with consideration of gender and age confirms the statistical significance identified above. With ages added into the view, the notched boxes are incredibly close between genders. Ages 25 until 62 show statistical significance, but the younger and eldest ages provide greater variance. The extremes make up just under 10% of the total population. Moving forward we have the option to omit or keep them from model building. Since the variances are not likely large enough to drive serious issues, I’ll elect to retain them for now. Although if the models perform poorly, I may retry with them excluded. This also depends on the significance of credit limit feature importance.
After performing the EDA and getting a better feel of this data, it appears that the best use for this information is to develop machine learning models to predict credit defaults. This will naturally be the default.payment.next.month category and the rest of the data will be used as features. I’m going to exclude the ID field as it does not carry any significance to the categories. I’m also going to start by using a decision tree for classification. Decisions trees are very flexible and require very little to no data adaptation. For an initial classifier model, this will be a great place to start.
list(data) # just listing to copy and paste target and features
# removing ID from feature list
target = "default.payment.next.month"
features = ["LIMIT_BAL",
"SEX",
"EDUCATION",
"MARRIAGE",
"AGE",
"PAY_1",
"PAY_2",
"PAY_3",
"PAY_4",
"PAY_5",
"PAY_6",
"BILL_AMT1",
"BILL_AMT2",
"BILL_AMT3",
"BILL_AMT4",
"BILL_AMT5",
"BILL_AMT6",
"PAY_AMT1",
"PAY_AMT2",
"PAY_AMT3",
"PAY_AMT4",
"PAY_AMT5",
"PAY_AMT6"]
X = data[features]
y = data[target]
from sklearn.model_selection import train_test_split, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 12) # 70/30 train test split
dtc = DecisionTreeClassifier(max_depth = 8, random_state = 12)
dtc.fit(X_train, y_train)
preds = dtc.predict(X_test)
accuracy_score(y_test, preds)
Using all the features and an out of the box decision tree classifier, we reached an 81.4% accuracy score. While this is definitely not a strong enough score for implementation purposes, it’s a decent start. The next steps are twofold. Let’s look at the feature importance before reruning the model and then work on some hyper parameter turning to get the most out of this model.
featdf = pd.DataFrame({"Feature": features, "Feature Importances": dtc.feature_importances_})
featdf = featdf.sort_values(by="Feature Importances", ascending = False)
featdf
plt.title("Feature Importances", fontsize = 18)
f = sns.barplot(x = "Feature", y = "Feature Importances", data = featdf, palette = "Spectral")
f.set_xticklabels(f.get_xticklabels(), rotation = 90)
plt.show()
PAY_1 comes in with an importance of 56%! That’s quite a significant parameter. Just looking at the feature importance spread, the first two make up for an impressive amount of prediction power. I’m going to cut down on the number of features because they may be causing some overfitting of the model. I’ll focus on three and see if we can improve the model by that alone.
Another consideration is the level of importance these features have in relation to the EDA above. It was identified that variances based on gender yielded very minimal differences. Further, the gender category is dead last for prediction power.
target = "default.payment.next.month"
features = ["PAY_1", "PAY_2", "BILL_AMT1"] # limiting features to three
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 12)
dtc = DecisionTreeClassifier(max_depth = 8, random_state = 12)
dtc.fit(X_train, y_train)
preds = dtc.predict(X_test)
accuracy_score(y_test, preds)
Using the same model as above with just three features, the accuracy of our model improved by 0.3%. This is not a significant change, but it does tell us that we may use less features for this model moving forward. Since this is a one-off analysis, this doesn’t carry much weight, but while using information like this in a different situation, gathering data and developing similar models would be more streamlined. It would also require less computation time from querying and performance. This is much more impactful when datasets are millions in rows instead of the 30k in this.
I’ll be performing a grid search next for hyper parameter tuning. The goal here is to push this model until we get the best performance out of it. If scores are still lacking, we can build a new model to compare performance.
from sklearn.model_selection import GridSearchCV
param_grid = {"max_depth": np.arange(1, 12), "criterion": ["gini", "entropy"],
"min_samples_split": [2, 5, 10, 20], "max_leaf_nodes": [4, 12, 20, 40]}
dtc_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv = 6, scoring= "accuracy")
dtc_search.fit(X_train, y_train)
dtc_search.best_estimator_
dtc2 = DecisionTreeClassifier(class_weight = None, criterion = "gini", max_depth = 3,
max_features = None, max_leaf_nodes = 12, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1,
min_samples_split = 2, min_weight_fraction_leaf = 0.0,
presort = False, random_state = 12, splitter = "best")
dtc2.fit(X_train, y_train)
preds = dtc2.predict(X_test)
accuracy_score(y_test, preds)
Unfortunately our hyper parameter tuning only yields a fraction of a percent increase from before. At this point, we can consider this model with these features exhausted of performance. To clarify, there are still plenty of additional variations and parameters that may be tested. These features remain vanilla, and feature engineering may be conducted for significant performance. Best practice involves testing more models with these current features before conducting feature engineering. I will not be going into feature engineering for this dataset since the purpose of this EDA was to discover the story behind this information. This led us to creating predictive models in order to detect credit defaults.
Before concluding this EDA, let’s try a random forest classifier for predicting credit defaults. We’ll use the three features from above and start with a grid search for optimal parameters.
from sklearn.ensemble import RandomForestClassifier
param_grid = {"n_estimators": [100, 200, 400, 500], "criterion": ["entropy", "gini"], "class_weight" : ["balanced"]}
rfc_search = GridSearchCV(RandomForestClassifier(), param_grid, cv = 6, scoring = "accuracy")
rfc_search = rfc_search.fit(X_train, y_train)
rfc_search.best_estimator_
rtc = RandomForestClassifier(bootstrap = True, class_weight = "balanced",
criterion = "gini", max_depth = None, max_features = "auto",
max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1,
min_samples_split = 2, min_weight_fraction_leaf = 0.0,
n_estimators = 100, n_jobs = None, oob_score = False,
random_state = 12, verbose = 1, warm_start = False)
rtc.fit(X_train, y_train)
preds = rtc.predict(X_test)
accuracy_score(y_test, preds)
I was hoping to end this EDA with a higher scoring model, but that just wasn’t the case. Given the two classifiers used, decision trees have a considerable accuracy performance over the random forest. Still, an accuracy score being in the 80% range is quite low for professional use. If applied as is, it would still provide greater insight than nothing at all, but is far from optimal.
There are two theoretical progressions from this point. We may ask about the data source or conduct an investigation on the labels to gain insight from the undefined data classifications. Alternatively, proceeding into a feature engineering analysis is the most realistic option. Further studying the relationship between the data provided will assist in developing better variables to increase model accuracy and overall performance.